Bootstrapping Bilingual Lexicons from Comparable Corpora for Closely Related Languages
نویسندگان
چکیده
In this paper we present an approach to bootstrap a Croatian-Slovene bilingual lexicon from comparable news corpora from scratch, without relying on any external bilingual knowledge resource. Instead of using a dictionary to translate context vectors, we build a seed lexicon from identical words in both languages and extend it with context-based cognates and translation candidates of the most frequent words. By enlarging the seed dictionary for only 7% we were able to improve the baseline precision from 0.597 to 0.731 on the mean reciprocal rank for the ten top-ranking translation candidates with a 50.4% recall on the gold standard of 500 entries.
منابع مشابه
A Methodology for Bilingual Lexicon Extraction from Comparable Corpora
Dictionary extraction using parallel corpora is well established. However, for many language pairs parallel corpora are a scarce resource which is why in the current work we discuss methods for dictionary extraction from comparable corpora. Hereby the aim is to push the boundaries of current approaches, which typically utilize correlations between co-occurrence patterns across languages, in sev...
متن کاملConstraint-Based Bilingual Lexicon Induction for Closely Related Languages
The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction becomes a difficult task for low-resource languages. Pivot language and cognate recognition approach have been proven useful to induce bilingual lexicons for such languages. We analyze the features of closely related languages and define a semantic constraint assumption. Based on the assumption, we propose...
متن کاملBootstrapping Unsupervised Bilingual Lexicon Induction
The task of unsupervised lexicon induction is to find translation pairs across monolingual corpora. We develop a novel method that creates seed lexicons by identifying cognates in the vocabularies of related languages on the basis of their frequency and lexical similarity. We apply bidirectional bootstrapping to a method which learns a linear mapping between context-based vector spaces. Experim...
متن کاملExtraction de lexiques bilingues à partir de Wikipédia (Bilingual lexicon extraction from Wikipedia) [in French]
________________________________________________________________________________________________________ Bilingual lexicon extraction from Wikipedia With the increased interest of the machine translation, needs of multilingual resources such as comparable corpora and bilingual lexicon has increased. These resources are not available mainly for pair of languages that do not involve English. This...
متن کاملBilingual lexicon extraction from comparable corpora using in-domain terms
Many existing methods for bilingual lexicon learning from comparable corpora are based on similarity of context vectors. These methods suffer from noisy vectors that greatly affect their accuracy. We introduce a method for filtering this noise allowing highly accurate learning of bilingual lexicons. Our method is based on the notion of in-domain terms which can be thought of as the most importa...
متن کامل